Data

First, we will read in the MISR datasets which have been matched to the AQS and CSN datasets. These data were matched spatially by considering every AQS/CSN data collection site within a 2.2 km radius of a MISR data pixel, and these matches were further filtered by matching these observations based on the dates when they were recorded.

We will also slightly alter these datasets, by changing the way that dates are stored in the data. Instead of storing dates as one object in a YYYY-MM-DD format, we will instead store the day, month, and year as three separate attributes.

In addition to the data which we collected from the CSN dataset, we will also use a formula to estimate the total dust mass in a given area, based on the presence of certain elements.

The formula to compute dust mass is given by \(\text{Dust Mass} = 2.2\times\text{Al} + 2.49\times\text{Si} + 1.63\times\text{Ca} + 1.94\times\text{Ti} + 2.42\times\text{Fe}\).

Exploratory Data Analysis

First, we will do some exploratory data analysis for these datasets, so we can have a better understanding of the data which we collected.

Numerical Summaries

We will examine some brief numerical summaries of our main four variables, in order to know more about their general distributions.

Numerical Summaries of the variables we want to predict
Minimum 25th percentile Median 75th percentile Maximum IQR Range Mean Standard Deviation Present Values Missing Values
Dust Mass -0.0284 0.3503 0.6126 1.0773 18.3301 0.7270 18.3584 0.9065 1.0361 5115 174
Nitrate 0.0000 0.6300 1.4000 3.4400 53.9000 2.8100 53.9000 3.0576 4.6030 5073 216
Sulfate 0.0000 0.5640 0.9943 1.6800 10.7000 1.1160 10.7000 1.3069 1.1535 5094 195
PM2.5 -7.2000 5.0000 8.0833 12.4583 529.4167 7.4583 536.6167 10.4596 10.6680 157005 0

Based on the table above, we see that there are a few negative values recorded for Dust Mass and PM2.5 concentrations. As a concentration must be strictly non-negative (as we cannot have negative amounts of a particle), we will replace all negative values with 0.

Histograms

In addition to examining numerical summaries of these values, we will also examine histograms of the values to see the overall distributions of these variables in a more visual manner.

From the plots above, we see that the distributions for each of these four variables are all somewhat right-skewed, as there are quite a few high-valued outliers in these datasets, and there are not a corresponding amount of low values in these data, as these data are all strictly non-negative.

The log-plots above all appear to be relatively symmetrical and look somewhat like Normal distributions, which may be helpful for model fitting and prediction purposes, as these distributions are significantly less skewed by their few large values.

Historical Data

In addition to the histograms which show the general distributions of these data over our 22-year period, we have created time series plots of dust mass, nitrate, PM2.5, and sulfate concentrations in California over time as a way to visualize how these quantities have changed over time.

The PM2.5 data is sourced from the AQS data collection sites, whereas the dust mass, nitrate, and sulfate concentrations come from the CSN datasets.

Monthly Dust Mass Concentrations in California Monthly Nitrate Concentrations in California

Monthly PM2.5 Concentrations in California

Monthly Dust Mass Concentrations in California

Finding Missing Values

To start off, we will examine counts of missing values in our datasets, to determine how much of the data which we aim to use is actually present in the dataset.

Counts of Variables in the merged MISR and AQS dataset
Variable Name Recorded Values Missing Values
PM25 157005 0
Year 157005 0
Month 157005 0
Day 157005 0
Site.Latitude 157005 0
Site.Longitude 157005 0
elevation 157005 0
pixel.latitude 157005 0
pixel.longitude 157005 0
AOD 50089 106916
AOD_uncertainty 50089 106916
angstrom_exp_550_860 50089 106916
AOD_absorption 50089 106916
AOD_nonspherical 50089 106916
small_mode_AOD 50089 106916
medium_mode_AOD 50089 106916
large_mode_AOD 50089 106916
aod_mix_01 57964 99041
aod_mix_02 58151 98854
aod_mix_03 58387 98618
aod_mix_04 58675 98330
aod_mix_05 58870 98135
aod_mix_06 59079 97926
aod_mix_07 59220 97785
aod_mix_08 59105 97900
aod_mix_09 58027 98978
aod_mix_10 54411 102594
aod_mix_11 63009 93996
aod_mix_12 62969 94036
aod_mix_13 62990 94015
aod_mix_14 62899 94106
aod_mix_15 62332 94673
aod_mix_16 61316 95689
aod_mix_17 59339 97666
aod_mix_18 56133 100872
aod_mix_19 51824 105181
aod_mix_20 46472 110533
aod_mix_21 49385 107620
aod_mix_22 49028 107977
aod_mix_23 48577 108428
aod_mix_24 47605 109400
aod_mix_25 46377 110628
aod_mix_26 45070 111935
aod_mix_27 43693 113312
aod_mix_28 42072 114933
aod_mix_29 40508 116497
aod_mix_30 38868 118137
aod_mix_31 61811 95194
aod_mix_32 61716 95289
aod_mix_33 61529 95476
aod_mix_34 60895 96110
aod_mix_35 60081 96924
aod_mix_36 58316 98689
aod_mix_37 55627 101378
aod_mix_38 52266 104739
aod_mix_39 48168 108837
aod_mix_40 43772 113233
aod_mix_41 53302 103703
aod_mix_42 53203 103802
aod_mix_43 53113 103892
aod_mix_44 52600 104405
aod_mix_45 51719 105286
aod_mix_46 50327 106678
aod_mix_47 48476 108529
aod_mix_48 46027 110978
aod_mix_49 43313 113692
aod_mix_50 40558 116447
aod_mix_51 60791 96214
aod_mix_52 54792 102213
aod_mix_53 38906 118099
aod_mix_54 51407 105598
aod_mix_55 43584 113421
aod_mix_56 33043 123962
aod_mix_57 38243 118762
aod_mix_58 33797 123208
aod_mix_59 29758 127247
aod_mix_60 29972 127033
aod_mix_61 29164 127841
aod_mix_62 28486 128519
aod_mix_63 38842 118163
aod_mix_64 37758 119247
aod_mix_65 36746 120259
aod_mix_66 35872 121133
aod_mix_67 29471 127534
aod_mix_68 28945 128060
aod_mix_69 28730 128275
aod_mix_70 28666 128339
aod_mix_71 28061 128944
aod_mix_72 28061 128944
aod_mix_73 28068 128937
aod_mix_74 28088 128917

First, we notice that there are no missing values for the Date and PM2.5 variables, which is excellent, as these are arguably our two most important variables.

We can also notice that there are the same amount of recorded and missing values for each of the 8 AOD variables. If we examine these 8 variables further, we find that they are a “package deal”; for each observation, there is either a recorded value for all 8 of these variables, or a missing value for all 8 variables.

Unfortunately, the same cannot be said for the 74 AOD mixture variables. From the table above, we can clearly see that the number of available observations varies for each of the 74 mixtures. However, of these 74 mixtures, the mixtures with the fewest number of recorded observations (aod_mix_71 and aod_mix_72) each have 36746 recorded values. Furthermore, a table containing all 74 mixtures would have 20604 observations which have a recorded value for each of the 74 mixtures, which is a fair amount of data to work with.

Charts and Graphs

Next, we will create some charts and plots of the matched MISR data, to get visual representations of the data which we have collected.

First, we will create a “correlation heatmap” to visually depict the correlations between the 74 AOD mixtures which were collected in the MISR data. In the correlation heatmap shown below, the correlations between these different mixtures are measured from -1 to 1, and each square in the heatmap is coloured in, with it’s colour and intensity proportional to the correlation between the variables.

Correlation Heatmap for the 74 MISR Mixtures

As we can clearly see in the correlation heatmap displayed above, the 74 AOD mixtures in the collected MISR data are all strongly correlated with one another, as the entire heatmap is green.

In fact, the weakest correlation between a pair of these 74 AOD mixtures is 0.681, which is the correlation between aod_mix_01 and aod_mix_44, which is still considered to be a strong positive linear relationship between two variables.

Model Fitting

Next, we will test a variety of different model fitting techniques on our dataset in order to determine which models are generally more efficient and serve as better models to make predictions for our dataset.

We will create a whole host of different models, as we have multiple different values in these two datasets which we want to predict, and there are multiple different sets of predictors which we aim to incorporate.

The 6 main values which we want to predict are; PM2.5, \(\text{SO}_{4}^{2-}\) (sulfate), \(\text{NO}_{3}^{-}\) (nitrate), dust mass, elemental carbon, and organic carbon. The two primary sets of predictors which we want to use are the 8 measured AOD values, and the 74 MISR AOD mixtures.

In addition to these two sets of predictors mentioned above, we will also introduce a “Months” variable to help account for the changes in these values over time. The Months variable will be computed by determining how many months it has been since March 2000, which represents the beginning of our collected data.

First, we will remove all rows with missing values for these desired predictors, and then we will split both of our datasets into a training dataset, a validation dataset, and a test dataset, with a 70/15/15 split for the training, validation, and test datasets, respectively.

PM2.5 AOD XGBoost

Model Performance of xgboost models on the validation dataset
Model: Predicting PM2.5 using AOD
nrounds eta max_depth gamma colsample_bytree min_child_weight subsample RMSE R2
100 0.1 10 0.01 0.50 0 0.50 5.347618 0.6397262
100 0.3 10 0.01 0.50 0 0.50 5.256340 0.6509654
100 0.6 10 0.01 0.50 0 0.50 5.782292 0.5971254
100 1.0 10 0.01 0.50 0 0.50 7.522182 0.4657776
100 0.1 10 0.01 0.75 0 0.50 5.286397 0.6478273
100 0.3 10 0.01 0.75 0 0.50 5.445571 0.6273246
100 0.6 10 0.01 0.75 0 0.50 5.772512 0.6019514
100 1.0 10 0.01 0.75 0 0.50 7.472564 0.4723911
100 0.1 10 0.01 1.00 0 0.50 5.349382 0.6379712
100 0.3 10 0.01 1.00 0 0.50 5.276000 0.6492055
100 0.6 10 0.01 1.00 0 0.50 5.985256 0.5755555
100 1.0 10 0.01 1.00 0 0.50 7.847674 0.4344741
100 0.1 10 0.01 0.50 1 0.50 5.347453 0.6400919
100 0.3 10 0.01 0.50 1 0.50 5.290056 0.6466636
100 0.6 10 0.01 0.50 1 0.50 5.989013 0.5785644
100 1.0 10 0.01 0.50 1 0.50 7.477596 0.4606548
100 0.1 10 0.01 0.75 1 0.50 5.444863 0.6252738
100 0.3 10 0.01 0.75 1 0.50 5.107947 0.6698592
100 0.6 10 0.01 0.75 1 0.50 5.732420 0.6059978
100 1.0 10 0.01 0.75 1 0.50 7.933537 0.4247203
100 0.1 10 0.01 1.00 1 0.50 5.327757 0.6414804
100 0.3 10 0.01 1.00 1 0.50 5.212554 0.6562185
100 0.6 10 0.01 1.00 1 0.50 5.947852 0.5815995
100 1.0 10 0.01 1.00 1 0.50 7.515620 0.4528092
100 0.1 10 0.01 0.50 0 0.75 5.196092 0.6604474
100 0.3 10 0.01 0.50 0 0.75 5.028545 0.6793359
100 0.6 10 0.01 0.50 0 0.75 5.343924 0.6475261
100 1.0 10 0.01 0.50 0 0.75 5.934369 0.5913790
100 0.1 10 0.01 0.75 0 0.75 5.187687 0.6616433
100 0.3 10 0.01 0.75 0 0.75 5.077847 0.6734901
100 0.6 10 0.01 0.75 0 0.75 5.389861 0.6417424
100 1.0 10 0.01 0.75 0 0.75 6.063941 0.5754576
100 0.1 10 0.01 1.00 0 0.75 5.168003 0.6621746
100 0.3 10 0.01 1.00 0 0.75 5.098299 0.6715291
100 0.6 10 0.01 1.00 0 0.75 5.301033 0.6548736
100 1.0 10 0.01 1.00 0 0.75 6.244188 0.5646597
100 0.1 10 0.01 0.50 1 0.75 5.136201 0.6696021
100 0.3 10 0.01 0.50 1 0.75 5.035703 0.6784534
100 0.6 10 0.01 0.50 1 0.75 5.292507 0.6525128
100 1.0 10 0.01 0.50 1 0.75 6.073099 0.5811894
100 0.1 10 0.01 0.75 1 0.75 5.138485 0.6678540
100 0.3 10 0.01 0.75 1 0.75 4.970264 0.6867858
100 0.6 10 0.01 0.75 1 0.75 5.367675 0.6438022
100 1.0 10 0.01 0.75 1 0.75 6.325604 0.5515490
100 0.1 10 0.01 1.00 1 0.75 5.187059 0.6599201
100 0.3 10 0.01 1.00 1 0.75 4.992194 0.6847457
100 0.6 10 0.01 1.00 1 0.75 5.362777 0.6461933
100 1.0 10 0.01 1.00 1 0.75 6.058154 0.5820437
100 0.1 10 0.01 0.50 0 1.00 5.129359 0.6701218
100 0.3 10 0.01 0.50 0 1.00 4.918802 0.6931672
100 0.6 10 0.01 0.50 0 1.00 5.200005 0.6608385
100 1.0 10 0.01 0.50 0 1.00 5.891995 0.5952795
100 0.1 10 0.01 0.75 0 1.00 5.091848 0.6728921
100 0.3 10 0.01 0.75 0 1.00 4.972647 0.6865903
100 0.6 10 0.01 0.75 0 1.00 5.263864 0.6556700
100 1.0 10 0.01 0.75 0 1.00 5.825369 0.6006026
100 0.1 10 0.01 1.00 0 1.00 5.154637 0.6641255
100 0.3 10 0.01 1.00 0 1.00 5.009561 0.6822507
100 0.6 10 0.01 1.00 0 1.00 5.234068 0.6611123
100 1.0 10 0.01 1.00 0 1.00 5.865548 0.6013663
100 0.1 10 0.01 0.50 1 1.00 5.217512 0.6580023
100 0.3 10 0.01 0.50 1 1.00 5.072691 0.6736639
100 0.6 10 0.01 0.50 1 1.00 5.096428 0.6735823
100 1.0 10 0.01 0.50 1 1.00 6.022372 0.5812974
100 0.1 10 0.01 0.75 1 1.00 5.145960 0.6657996
100 0.3 10 0.01 0.75 1 1.00 4.914138 0.6937481
100 0.6 10 0.01 0.75 1 1.00 5.052396 0.6806699
100 1.0 10 0.01 0.75 1 1.00 6.138499 0.5685186
100 0.1 10 0.01 1.00 1 1.00 5.154637 0.6641255
100 0.3 10 0.01 1.00 1 1.00 5.009561 0.6822507
100 0.6 10 0.01 1.00 1 1.00 5.234068 0.6611123
100 1.0 10 0.01 1.00 1 1.00 5.865548 0.6013663